43 research outputs found

    AmalgamScope: merging annotations data across the human genome

    Get PDF
    The past years have shown an enormous advancement in sequencing and array-based technologies, producing supplementary or alternative views of the genome stored in various formats and databases. Their sheer volume and different data scope pose a challenge to jointly visualize and integrate diverse data types. We present AmalgamScope a new interactive software tool focusing on assisting scientists with the annotation of the human genome and particularly the integration of the annotation files from multiple data types, using gene identifiers and genomic coordinates. Supported platforms include next-generation sequencing and microarray technologies. The available features of AmalgamScope range from the annotation of diverse data types across the human genome to integration of the data based on the annotational information and visualization of the merged files within chromosomal regions or the whole genome. Additionally, users can define custom transcriptome library files for any species and use the file exchanging distant server options of the tool

    Protein signatures using electrostatic molecular surfaces in harmonic space

    Full text link
    We developed a novel method based on the Fourier analysis of protein molecular surfaces to speed up the analysis of the vast structural data generated in the post-genomic era. This method computes the power spectrum of surfaces of the molecular electrostatic potential, whose three-dimensional coordinates have been either experimentally or theoretically determined. Thus we achieve a reduction of the initial three-dimensional information on the molecular surface to the one-dimensional information on pairs of points at a fixed scale apart. Consequently, the similarity search in our method is computationally less demanding and significantly faster than shape comparison methods. As proof of principle, we applied our method to a training set of viral proteins that are involved in major diseases such as Hepatitis C, Dengue fever, Yellow fever, Bovine viral diarrhea and West Nile fever. The training set contains proteins of four different protein families, as well as a mammalian representative enzyme. We found that the power spectrum successfully assigns a unique signature to each protein included in our training set, thus providing a direct probe of functional similarity among proteins. The results agree with established biological data from conventional structural biochemistry analyses.Comment: 9 pages, 10 figures Published in PeerJ (2013), https://peerj.com/articles/185

    3D structural analysis of proteins using electrostatic surfaces based on image segmentation

    Get PDF
    Herein, we present a novel strategy to analyse and characterize proteins using protein molecular electrostatic surfaces. Our approach starts by calculating a series of distinct molecular surfaces for each protein that are subsequently flattened out, thus reducing 3D information noise. RGB images are appropriately scaled by means of standard image processing techniques whilst retaining the weight information of each protein’s molecular electrostatic surface. Then homogeneous areas in the protein surface are estimated based on unsupervised clustering of the 3D images, while performing similarity searches. This is a computationally fast approach, which efficiently highlights interesting structural areas among a group of proteins. Multiple protein electrostatic surfaces can be combined together and in conjunction with their processed images, they can provide the starting material for protein structural similarity and molecular docking experiments.  

    Outcome prediction based on microarray analysis: a critical perspective on methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Information extraction from microarrays has not yet been widely used in diagnostic or prognostic decision-support systems, due to the diversity of results produced by the available techniques, their instability on different data sets and the inability to relate statistical significance with biological relevance. Thus, there is an urgent need to address the statistical framework of microarray analysis and identify its drawbacks and limitations, which will enable us to thoroughly compare methodologies under the same experimental set-up and associate results with confidence intervals meaningful to clinicians. In this study we consider gene-selection algorithms with the aim to reveal inefficiencies in performance evaluation and address aspects that can reduce uncertainty in algorithmic validation.</p> <p>Results</p> <p>A computational study is performed related to the performance of several gene selection methodologies on publicly available microarray data. Three basic types of experimental scenarios are evaluated, i.e. the independent test-set and the 10-fold cross-validation (CV) using maximum and average performance measures. Feature selection methods behave differently under different validation strategies. The performance results from CV do not mach well those from the independent test-set, except for the support vector machines (SVM) and the least squares SVM methods. However, these wrapper methods achieve variable (often low) performance, whereas the hybrid methods attain consistently higher accuracies. The use of an independent test-set within CV is important for the evaluation of the predictive power of algorithms. The optimal size of the selected gene-set also appears to be dependent on the evaluation scheme. The consistency of selected genes over variation of the training-set is another aspect important in reducing uncertainty in the evaluation of the derived gene signature. In all cases the presence of outlier samples can seriously affect algorithmic performance.</p> <p>Conclusion</p> <p>Multiple parameters can influence the selection of a gene-signature and its predictive power, thus possible biases in validation methods must always be accounted for. This paper illustrates that independent test-set evaluation reduces the bias of CV, and case-specific measures reveal stability characteristics of the gene-signature over changes of the training set. Moreover, frequency measures on gene selection address the algorithmic consistency in selecting the same gene signature under different training conditions. These issues contribute to the development of an objective evaluation framework and aid the derivation of statistically consistent gene signatures that could eventually be correlated with biological relevance. The benefits of the proposed framework are supported by the evaluation results and methodological comparisons performed for several gene-selection algorithms on three publicly available datasets.</p

    On a meaningful integration of web services in data-intensive biomedical environments: The DICODE approach

    Get PDF
    This paper reports on an innovative approach that aims to reduce information management costs in data-intensive and cognitively-complex biomedical environments. Recognizing the importance of prominent high-performance computing paradigms and large data processing technologies as well as collaboration support systems to remedy data-intensive issues, it adopts a hybrid approach by building on the synergy of these technologies. The proposed approach provides innovative Web-based workbenches that integrate and orchestrate a set of interoperable services that reduce the data-intensiveness and complexity overload at critical decision points to a manageable level, thus permitting stakeholders to be more productive and concentrate on creative activities

    The eNanoMapper database for nanomaterial safety information

    Get PDF
    Background: The NanoSafety Cluster, a cluster of projects funded by the European Commision, identified the need for a computational infrastructure for toxicological data management of engineered nanomaterials (ENMs). Ontologies, open standards, and interoperable designs were envisioned to empower a harmonized approach to European research in nanotechnology. This setting provides a number of opportunities and challenges in the representation of nanomaterials data and the integration of ENM information originating from diverse systems. Within this cluster, eNanoMapper works towards supporting the collaborative safety assessment for ENMs by creating a modular and extensible infrastructure for data sharing, data analysis, and building computational toxicology models for ENMs. Results: The eNanoMapper database solution builds on the previous experience of the consortium partners in supporting diverse data through flexible data storage, open source components and web services. We have recently described the design of the eNanoMapper prototype database along with a summary of challenges in the representation of ENM data and an extensive review of existing nano-related data models, databases, and nanomaterials-related entries in chemical and toxicogenomic databases. This paper continues with a focus on the database functionality exposed through its application programming interface (API), and its use in visualisation and modelling. Considering the preferred community practice of using spreadsheet templates, we developed a configurable spreadsheet parser facilitating user friendly data preparation and data upload. We further present a web application able to retrieve the experimental data via the API and analyze it with multiple data preprocessing and machine learning algorithms. Conclusion: We demonstrate how the eNanoMapper database is used to import and publish online ENM and assay data from several data sources, how the “representational state transfer” (REST) API enables building user friendly interfaces and graphical summaries of the data, and how these resources facilitate the modelling of reproducible quantitative structure–activity relationships for nanomaterials (NanoQSAR)

    How can we justify grouping of nanoforms for hazard assessment? Concepts and tools to quantify similarity

    Get PDF
    The risk of each nanoform (NF) of the same substance cannot be assumed to be the same, as they may vary in their physicochemical characteristics, exposure and hazard. However, neither can we justify a need for more animal testing and resources to test every NF individually. To reduce the need to test all NFs, (regulatory) information requirements may be fulfilled by grouping approaches. For such grouping to be acceptable, it is important to demonstrate similarities in physicochemical properties, toxicokinetic behaviour, and (eco)toxicological behaviour. The GRACIOUS Framework supports the grouping of NFs, by identifying suitable grouping hypotheses that describe the key similarities between different NFs. The Framework then supports the user to gather the evidence required to test these hypotheses and to subsequently assess the similarity of the NFs within the proposed group. The evidence needed to support a hypothesis is gathered by an Integrated Approach to Testing and Assessment (IATA), designed as decision trees constructed of decision nodes. Each decision node asks the questions and provides the methods needed to obtain the most relevant information. This White paper outlines existing and novel methods to assess similarity of the data generated for each decision node, either via a pairwise analysis conducted property-by-property, or by assessing multiple decision nodes simultaneously via a multidimensional analysis. For the pairwise comparison conducted property-by-property we included in this White paper: • A Bayesian model assessment which compares two sets of values using nested sampling. This approach is new in NF grouping. • A Arsinh-Ordered Weighted Average model (Arsinh-OWA) which applies the arsinh transformation to the distance between two NFs, and then rescales the result to the arsinh of a biologically relevant threshold before grouping using OWA based distance. This approach is new in NF grouping. • An x-fold comparison as used in the ECETOC NanoApp. • Euclidean distance, which is a highly established distance metric. The x-fold, Bayesian and Arsinh-OWA distance algorithms performed comparably in the scoring of similarity between NF pairs. The Euclidean distance was also useful, but only with proper data transformation. The x-fold method does not standardize data, and thus produces skewed histograms, but has the advantage that it can be implemented without programming knowhow. A range of multidimensional evaluations, using for example dendrogram clustering approaches, were also investigated. Multidimensional distance metrics were demonstrated to be difficult to use in a regulatory context, but from a scientific perspective were found to offer unexpected insights into the overall similarity of very different materials. In conclusion, for regulatory purposes, a property-by-property evaluation of the data matrix is recommended to substantiate grouping, while the multidimensional approaches are considered to be tools of discovery rather than regulatory methods

    Uncovering genomic block structure : Novel statistical approaches

    No full text
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore